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DoTmutli Item V/Titin^] Tochniqiiesl 

C, '** Roid and Ttioiiias U, Haladyna 
Toaching Research Division 
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"two tcjclinlqucs for writiiig achlovGmeiit tost itcnis to accommTiK instruc 
tional mater ials wei^e coiitrastsd^ (a.) generatiTig items froiTi statements of 
insti'Lictional objectives j and C^j) generating items from rules for transfoxni 
ing instructional statements (aclaptod t'wm Borrnuth) . Itenis of each ty^pe 
were written by two expeTienccd itfini writers* Subjects were given tests 
employing th^se items before and after reading a programTtied boolslet, Orie 
item witer ms found to pi^odiice consistently' more difficult test items 
regai'dless of the technique usod. Thi-s resiiLt siipports the contciition that 
objectiv©-'ba5ed item writing results in items of vwying quality^ but is in 
conflict with the hypothesis that the rule-j;eneration technique ©limlnates 
"Subjectivity^ in item writinj* The iieed for further investigation of 
ftiil/--aiitoiiiated> linguistic-based miles for item writing is suggested* 
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k Comparison of Obj ectiva-Based and 
Bormuth Item-Writiiig Techniques 



G. H, Roid and Thomas M, Haladrna 
Teaching Research Division 
Oregon State System of Higher Education 

Bonnuth (197 0) has suggested that achlevenient test items should be 
constructed by using operationally defined itein writing techniques, so that 
a pTeclse description of what has been lea.med and measured cari be made. 
These techniques iavolve the algorithmic transforniatiori of senteiices from 
prose instructiQn into test items aiid allow for automated item generatioa. 
Hively (1974) has proposed the concept of itein fonns for generating domain- 
refefenc^d tast Items, a concept similar to Bcrmuth^s item TOitiiig rules, 
Anderson 0^72)^ Willman (19743 j others hav^e reiterated the need for 
itein writing rulea and Anderson has emphasised the importance of insuring 
that test itiims ccatain wording or examples different f^om those used in 
instruction in order to truly test comprehinsion* 

Bonnuth 097fl) has contrasted traditional methods of item CQastruciioTi 
(as represented by the methods of Bloonii Thorndike^ and Hagan^ etc j with 
item construetion using operational definitions or rules. Operational itein 
writing TUles for achievement test items are a series of directicns which 
tell an item writer how to rearrange segments of the instruction to obtain 
the items of that type. A simple emmple would be ^'subject deletion'' itefis 
which would be wltten using a detailed rule sumnaTi^ed as folloi^s: "InspecT: 
all sentences in the instruction, raplaciag a "wh-pro'* wrd such as who ^ 
what, or vtfhare f oy the appropriate subject aach sentance* For instance^ 
"The boy rode the horse'* would be transforTned to: '^Who rode the horse?*'* 
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In cont^'ast to opeTationally defined techniques, Bonnuth (1970, p. 10) 
and Millraan (1974, p. 325) have suggested that the use of instTUctional 
objectiws QT ether traditional item vnriting methods allo¥ the item writer 
so many options that no two itcin writers could be expected to produce coin- 
parable tests, 

Ho\^eYfer^ as Cronbach (1970) commeiited, coitiplete and useful guidelines 
for the awmge test constructor were not provided in the work of Bormuth 
{1970), In fact, unless one is a CQmpeterit linguist versed in transforma- 
tional granmar^ the implementation of Bormuth^s method for tests of pTose 
learning mil be difficult indeed. TTie complexity of the transformations 
required to jnake sentences into test itMs is reflected in the 82- step 
algorithm posed by Finn (Note 1], a former student of Boxniuth's, Alm^ 
HaabletDn^ ^^-i CNote 2, p. 15) called for a compromise method inteniiediai 
to autoittated item generation and the use of olj^ctiveSs hecause *'it does not 
appear likely that the notion of specifying content v^ia the use of iteni gen- 
emtiom rules w^ill be applicabla to mmy subject areas." For these reasans 
and because of the obvious importance of BoT^mth* $ suggest ionSj the present 
study ms desipied to examine the ef fectiveniss of using rules for item 
vrriting ^hich are similar to, hut not as rigorously automated as those sug- 
gested by Borniuth, The item-TOiting rules of the present study weTe used to 
transfoiim senteiices taken directly from an instxiictioml booklet, and used 
ke>n^Qrd deKatioii and wh-pronoynj in s^tactic transforniations as suggested* 
Howe\rer, tlhe rules were written so that most item waiters unfamiliar with 
linguistics could still implemerit them. 

In addltton^ to a comparision of objective-based aiid xule-geTierated item 
metliodSi the present study examined the contemion of todersoa (19723 that 



paTaphraslag of words from the instnicti onal text for use in test items more 
adequately assesses compTeJeTision than using verbatim words from the te^t. 
Specifically, the present study tested the h)^othesis that items w4th verbatim 
wording would be easier than items with pa^raphrased wordiiig, 
In sumary, the study corLtrastsed the following: 

1. ittsins w=Titten froni staternejits of the instxuctional objectives of 
a programmed instwct ional booklet vs, items written from iteni 
writ lag rules for the same boaklet. 

2. It^ms designed to nieasur& veTbatini TecaLl vs. conipiehansiori Ccompre^ 
hemsion measured by iteins ^hich use pamphras ing of iristructional 
statements or examples iiot given in the instructiori) . 

5. rtems Written using eacli of the above stTategias by tw^o diffeTant 
expeifienced item witers* 

T^e miaJOT question mider investigation ms "Do different item writers 
produce test it^nis which are more similar Cstatisttcallf ] if th^y use opera-- 
tioiial itm writing rules thaii if they use instructional Dbjectlves as a 

Method 

Subjict s. Seventy^two dental students frcm the UniveTsity of Oregoii 
Health Science Center, School of Dentistry served as subjects* TTiey were 
all enrolled in a secoM-year course in Crowi and Bridge Tccliiiiques , The 
experiment was condiicted as part of the regiilar course worl aad an iastruc- 
tional booklet used in the experiment was required reading. 

Instruments, Four parallel test forms of 48 items each w^ere constructed 
to measure leariiing from the booklet, rtems of each ty^e we^e wittem by 
^ach of tvo iteni-writeTS :2 



^Iteni writers i^ere the authors, G. H. Roid and Thomas M. Haladyna* 
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(a) objective-based items 

(b) items generated from rules for transfoming instructional statements 

(c) verbatim xecall (recognition of ntultiple choice alternative), and 

(d) comprehansion items written W paTaphrasing instructional statemeiits 
OT using new exAinple. 

Item witing rules were of two types, keyword deletion and syntactical 
tranrifoT-natton. The rules for verbatim it-sms were WTritten as follows: 
Rale 1. Verbatim-Deletion. A ko-word ox phrase is delated and replaced by 
a blanlc. the word or phrase is included as one of four alternatives 
in a multiple-choice forniat. Except for delation, the exact words 
from the instructional statement beiiig used are tetained. 
Rule 2. Vjrbatim-Srntactical Trans forniation . Also niight be called a "vih- 

transformation". A. keyword or phrase is deleted from an instTUctional 
statement and Teplaced by a phrase such as "v^hich one of the following", 
or "what is.,.". The keyword or phrase is included as one of foui 
altemati-ves in a nultiple-choice format. Except for the additioii of 
the "wh-pTonoun", the exact words from the instructional stateinont are 
retained. 

Tm mms keyword deletion or "wh-transforination" rules were used *'or 
Rules 3 and 4 with the following rules of paraphmse added, as suggested by 
Anderson (1972) s (a) Mo substantive terms from the original statement should 
remain in the paraphrase including all nouns, veTbs, and modifiers, and {b) 
The meaning of the paTaphrase sentences Cs) should be identical to that in the 
oiiginal. Both the Item stem and each foil was paraphrased. 

All instructionally-relevant sentences in the prose text used in the 
e^cperimant were Identified and' numbered. " Instimctionally relevant" sentences 
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vrere defined as those which were not simp ly^ directions to the student C©*g*> 
"Now, let's examine Step 2''), Froni the relevant sentences, two mndoni sam- 
ples weT© drami, one for each item TOiter , An ejcajnple of one of the sentences 
is ^*Cleanability is a requirenieiit of every pontic." This sentence was trans- 
formed using Rule 2, for exainple, into the follomr.g test item: 
Miich is a requirement of evexY pontic? 
*a, cleanability 

b. appearance * 

prevention of splinting 
d. none of the above 

TweLve instructional objectives were also wxitten for the iiistructional 
booklet and then given to the item writers far use in producing the objective'^ 
based items. Each objective was written so that both a verbatim-recall (VR) 
or paraphrase^coinprehension (PC) item could be produced from it. An ejcample 
of one of the objective is: 

verbatim- VR 

principles, the student will 
^Iding concept s-PCj is being applied J- 
Itenis of each type were assigned randoinly to test fornis. In each test 
foTin, every other item was a rule-generated item^ pairs of verbatini items 
alternated with pairs of paraphrase items ^ aiid iteais written by each experi- 
mentor v/ere intermingled to prevent any sequence or fatigtie effects due to 
position of items in the test- 

Procedures . Students were given a pretest prior to being givan the 
instructional booklet and then a posttest approximately one month later, A 
retention test was given ten weeks after the posttest. Students were given 
different forms of the test at each adTninistration , Forms were assigned to 



verbatim- VR 
^ 'Given witten descriptions of ^ew*-PC ^ 

principles =VR 

identify which of the three 



subjects rmdomly on the basis of alphabetical gTOupiiigs of studeiits' mines^ 
e.g., all A-\}'b were given POiM ^ on the pretest, theii FOILM B on the posttest. 
Alphabetical groupings coincided with seat nssigrinients in the dental labora- 
tory classroom v/here testing was conducted. Groiipings numbered N^IS, N^lS^ 
N^19, and N^20, All test booklets were collected at the end of each testing 
sessica, A^sv^^e^ keys and Tesults vvexe given to each student as he computed 
the pcsttest and retention test. No feedback was given after the pretest. 

Results 

The results of this study indicate some similarities and some strikini 
differeiices between items vritten by different methods, Item analysej revealed 
that no^t items showed ^'InstTUctioaal sensitivity'' as measured by the difference 
between pretest and posttest item difficulties, Hov/evar, 19,S percent of the 
items did act show instructicnal sensitii/ityi For ejcample^ some item diffi- 
culties w^T$ uniform on pretest and posttest or v^ere lower on the posttest 
than on tje pretest. IinportaTitly^ these nciisensitive itenis w^ira uniformly 
distributed among the val^ious Iteni tyi5es--cbjective-baied ^ rule-generat edj 
varbatiiti, paraphrase^ and items of diffeTent writers, 

A fuTtler analysis of item difficulties was perforrned using a 2x2x2x4x3 
analysis of v^ariance design in w^hich the factors ^ere: (a) object ive^based 
vs, TUle-geTierated items, (bD verbatim vs. paraphrase items, (c] items of the 
first ^friter vs* the second writer^ (d) items assigned to each of the four 
test forms, and (e] a repeated-mesures factor-- -pret est , posttest^ and reten- 
tion tests. The dependent variable i^ the amlysis was item difficulty. 
Table 1 summarizes the results of the analysis of variaace 

Insert Table 1 about here 



A significant diffetence between rule-generated and objective-based 
items was found and an inspection of the means shows that rule items ware 
easier overall (mean difficulty .SO) than objective-based items (meaa diffi- 
culy .73]. One item writer produced items that were cons isteiitlf easier (.80), 
regardless' of the techmques used, while the other writer piDduced cDnsistentlr 
more difficult iteins (.723- There was a difference between the four test forms 
with mean difficulties of .72, .76, .79, and .80. TTie expected difference 
between test occa.si.ons was observad with mean difficultr on the pretest of .66, 
posttest .85, and retention test .79. No difference between verbatim and para- 
phrase items was observed, nor were there any significant inteTactions among 
the factors in the analysis, 

I„ addition to analyses of the differences in mean itew difficulties, an 
examination of the variance of item difficulties wa.s conducted. Variaiices 
were coinputed for each category of item difficulties based on techniques of 
item writing or test occasion, Also, variancos were computed for each of the 
four rules used to write Tule-generated items (and four categories of objective- 
based iteins for comparison purposes) to test the notion that item-wTiting rules 
pTOvide a unifying and stable influMQe on item characteristics, Variances for 
each categoiy of item difficulties are pxesented in Table 2. 

Insert Table 2 about here 

A.S shorn in Table 2, rule-generated items showed less variance overall 
than objective-based items, and this difference is statistically significant, 
F(287,2a73 = 1. 255 , £<.0S), However', it should be noted that the coinposite 
category of Item difficulties for Rule 2 across test occasioiis, which was .0733 
as shown in the right -haiid colunm of Table 2, had a higher variance than Bny of 
the compaTison categories of objective-based items. Hence, variances of all 
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ciitcyorics of rule-^generated item diff icLilties v/ere not consistently lower 
than all categories of objective-based item dif f rcu] tleB. 

Rules 1 and 3 which involve tho siniplo dolotion of a keyword ox phrase 
v/ere found to have a lower variance (coTnposite variance ^ » 03631 than Rules 2 
and 4 (composite variance « ,0605) which involves syntactical transformations 
of instructional sentences in the writing of items, Thm difference in the 
variance of these two t>^es of rules v/as statistically significant^ £ (153,1533 
- 1 ,66, p < .05. 

As would be expected when instruction is effective, the variance of post- 
test item difficulties was lower than that of pretest difficulties^ with 
retention test difficulties having a vaxiance in between the other two* 

The variance of item difficulties of all verbatim items was ,0563, 
nearly identical to the variance of the difficulties of all paraphrase items 
v^hich was , 056S , 

Discussion 

Neither the writing of items from instructional objectives nor the use 
of rules for transformiiig instructional sentences into test itemsj as imple- 
mented in this study^ coinpletely removed subjectivity from item siting. One 
of two item \^iters produced consistently more difficult items than the other 
Itein writer when both used the same objectives and the same rules* Also^ 
rule-generated items ware found to be easier than objective -based items ^ 
regardless of the item writer or test occasion. 

Since rule-generated items involved verbatim use or paTaphrasing of 
sentences directly from the instructional booklet used in the experiment^ 
it my be that students more easily detennined the answers to these items 
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through syiitactic cues or theiriatlc proinpts in the sentences or from general 
test-wiseness, It would not be corxact to say that the rule-^genexated itenis 
provided raore direct evaluations of instructional effectiveness than objective 
based items because of the fact that the rule-generated items weTe easier on 
the pretest as well as the posttest- 

Rule -'gene rated items were found to have less variability in item diffi- 
culties than objective-based itejns, lending some support to the contention 
that rules help to produce a more unifonn set of items for evaluating learn- 
ing from instructional materials. However^ it Is essential that low item 
variability not be coupled with a lack of instructional sensitivity and this 
lack was, in fact^ found in the present study. 

"nie finding of no significant difference between the item difficulties 
of verbatim and paraphrased items is suprising and in contradiction to the 
suggestions of Mderson (1972) and the findings of Bormuth, et^ al^^ (1970) • 
It would be ejcpected from these previous findings that verbatim itenis should 
be easier than paraphrase items and that paraphrase items should be better 
evidence for comprehension. Although paraphrasing was done carefully and 
thoroughly in the present study^ it may still be that the difficult task of 
precise and complete paraphrasing of every word in each item left some unde- 
tarmined incompleteiiess. However^ the present findings lend some support to 
the findings of Bortnick (Note 3) in a study of nonsense-syllable learning 
which provided sonie evidence for and some evidence against the efficacy of 
using semantic-substitute questions to assess comprehension. 

Conclusion 

The present study confirms one of the suggestions of Borinuth (19^0, p*10) 
and ^{illman (1974^ p. 375] that two item writers using an item-'WTiting 
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approach such as an object Ive-^based one will not produce items of similar 
quality. 

Mm, the tentative conclusion of the present study is that the use of 
rules for transfonning instructional sentences into test items will not reduce 
subjectivity in item writing or pToduce items of uniform statistical quality 
if the ruiles are less than completely operationally defined or automated. If 
the it^m wx cer retains functions such as choosing keywords to be delated 
from sentences^ choosing foils for multiple-choice questions^ or syntactically 
transfaic^lng sentences in an nonautomated fashion, the resulting tests mil 
not necessarily be of higher quality than tests written by traditional niethods, 
For exaiTipla, the present study may indicate as did the ,study by Richalc (mte 4} 
that the choice of which keyword to delete from an Instructional sentence 
during lt$ transformation will affect the resulting item-s difficulty* Richek 
found that questions eliciting a subject mode were easier than quastl^ns 
eliciting predicate nodes, In the present study, the choice of whicli keyword 
to delete was left to the item writer, 

TTie present study suggests that the development of true domaln-refereaced 
tests using item forms for evaluating learning from prose materials cannot be 
done casually^ without J perhaps , the somewhat detailed linguistic analyses 
suggestad by Bormuth (1970, Chap, 3) or detailed, linguistic-based algcrlthnis 
such ai the 82-step procedure suggested by Finn (Note 1), It should be 
caution^dj however* that the present study was not a controlled study com- 
paring automated and nonautomated methods so that this conclusion is scniewhat 
speculative- Clearly there is a need for more empirical study of this 
important research question. 
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Table 1 

Analysis of Vaxiance on Item Difficulties 



Source 


df 


MS 


F 


Between Items 








Rule vs. Objective (RD 


1 


5845,25 


5.60* 


Verbatim vs. Paraphrase (V) 


I 


1307.13 


1.2S 


Item Writer (W) 


I 


8304.06 


7.95*' 


Test Form (F) 


3 


2980.54 


2. 85* 


RxV 


1 


181,59 


.17 


RxW 


1 


2594.38 


2.48 


.a 








TxRjcVxWxF 


3 


2028.13 


1.94 


Error Between 


160 


1044.15 





Within Items 



Test Adininistratioii [T) 

TxV 

.a 
* 

to 

TxRxVxWxF 
Error Within 



2 
2 

2 

6 

320 



16201.88 
436. 77 
47.03 

37.04 
162. 79 



99.52** 
2.68 
.29 

.23 



* p ^ . 05 
** p < . 01 

*None of the two-way, three-way, and four-way interactions deleted 
from this table were significan+. The £ ratios of the deleted 
intBractions ranged from .],3 to ki.68 with associated protahllitiei 
above .05. 
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Table 2 



Tlte Variance of Item Difficulties on ThT&e 
Occasions fox Objective- Based aiid Rule^Geiierated Items 



Test Occasioiis 



1 ^ 

Type of Item ; 


Pretest 


Post test 


Retention 


Composlt e 


Rule Gen er at ad Items 










Rula 1 - VnbBtim ; 


.0568 


.0201 


.0248 


.0362 


! 


(24) 


(24) 


(24) 


(72) 


Rule 2 - Verbatim [ 


.0906 


.0537 


.0660 


.0733 




(24) 


(24) 


(24) 


(72) 


Rule 3 - Paraphrase 

i 


.0492 


.0246 


.0296 


.0361 


(24) 


(24) 


(24) 


(72) 


Rule 4 - ParaphTase j 


.0538 


.0261 


.0489 


.0484 


(24) 


(24) 


(24) 


(72) 


Composite ^ 


.0628 


.0306 


.0426 


. 0492 


(96) 


(96) 


(95) 


C288) 



Obj ective 


Iteiiis * 










Obj. 1-6 - 


VeTbatiin 


.0719 


.0397 


.0326 


. 054 0 




(24) 


(24) 


(24) 


(72) 


Obj . 7-12 


- Ver'ba.tim 


.0719 


.0113 


.0391 


.0571 






(24) 


(24) 


(24) 


(72) 


Obj. 1-6 - 


• Paraphrase 


.0899 


.0336 


.0586 


. 0689 




(24) 


(24) 


(24) 


(72) 


Obj . 7-12 


- Paraphrase 


.0417 


.0573 


.0498 


. 0586 






(24) 


(24) 


(24) 


(72) 


Composite 


.0704 


.0392 


.0462 


.0618 


(96) 


(96) 


(96) 


(288) 


Composite 


Totals 


.0698 


.0350 


.0450 


.0566 




(192) 


(192) 


(192) 


(576) 



Note: ^e number of item dif flciilties used to compute each variaTice is 
shown in parentheses. 

aitems for objective 1-6 and those fOT objectives 7-12 were merged for this^ 
analysis so that the n^bex of itenis used tc calculate each variance for 
each group of items wuld be uniform and comparable to those for each rule- 
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